attention mechanism
Generalization as of 2018
$ \mathrm{Attention}(query, Keys, Values) = \mathrm{Normalize}(F(query, Keys)) \cdot Values
There are queries and Keys, which are a bunch of multiple keys.
There is a function F that takes a query and Keys as arguments and returns the intensity of attention for each key.
The results are then normalized in some way to sum to 1 to obtain the attention intensity (roughly softmax, see Hard attention mechanism). Weighted average Values by their attention intensity
schematic
https://gyazo.com/211618e709ff284a379c5c2f502934da
F does not know the number of Key. $ F(query, Key) does not depend on the shape of Key.
I don't know how to express this in mathematical terms.
There is a function f that takes one query and one key, and [f(query, key) for key in Keys].
Func := Feed-Forward Network
$ Attention(query, Key, Value) = Softmax(FFN(concat(query, Key))) \cdot Value
By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixedlength vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.
The hidden state of the RNN is a fixed-length vector, and it is a burden to remember to pack the entire sentence's data into it
Attention mechanism can retrieve information from data of arbitrary length, thus reducing its burden
https://gyazo.com/dab69f04c581681e9c3c543b92633ef5
Split that query and key are simply inner products of query and key
$ Attention(query, Key, Value) = Softmax(query \cdot Key) \cdot Value
Of course, this inner product is sometimes expressed as a matrix product in some papers.
https://gyazo.com/1db88d9a61dcce7af368f0d0e594b3f1
Initially, the attention mechanism was envisioned to be used in combination with RNNs
Store the Encoder's hidden states in the Encoder-Decoder configuration and select from among those hidden states by the attention mechanism
In this configuration, Key and Value come from Encoder and query comes from Decoder.
This type of configuration is called [Source Target Attention
K and V together are called Memory
This one is Key, Value, query all are self... no, that definition is not balanced by the level of abstraction...
It may eventually differentiate into better terms.
So far, one implementation example is "everything comes from the lower layers."
In this form, it's a development of CNN. CNNs that could only accept fixed-length input were replaced by attention mechanism that could accept indefinite-length input. -----
Old commentary
This commentary implicitly assumes RNN and is not generalized: it is not a generalization. Create a scalar that represents the appropriate intensity of attention from "the current hiding state and its hidden states."
Normalize that scalar to a total of 1
Used for a weighted average of each hidden state
There is a way to use output layer values instead of past hidden states.
---
This page is auto-translated from /nishio/注意機構. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.